Indexing Using Both N-Grams and Words
نویسندگان
چکیده
Goals The Johns Hopkins University Applied Physics Laboratory (JHU/APL) is a first-time entrant in the TREC Category A evaluation. The focus of our information retrieval research is on the relative value of and interaction among multiple term types. In particular, we are interested in examining both words and n-grams as indexing terms. The relative values of words and n-grams have been disputed; to our knowledge though, no one has previously studied their relative merits while holding all other aspects of the system constant.
منابع مشابه
Comparison of Word-Based and Syllable-Based Retrieval for Tibetan
Tibetan retrieval based on automatically segmented words is compared with the use of overlapping syllable n-grams using a known-item retrieval evaluation. The optimal span of fixed-length n-grams is found to be 2 syllables, and indexing words is found to be as effective as indexing syllable bigrams.
متن کاملImproving KNN Arabic Text Classification with N-Grams Based Document Indexing
Text classification is the task of assigning a document to one or more of pre-defined categories based on its contents. This paper presents the results of classifying Arabic language documents by applying the KNN classifier, one time by using N-Gram namely unigrams and bigrams in documents indexing, and another time by using traditional single terms indexing method (bag of words) which supposes...
متن کاملA Polyphonic Music Retrieval System Using N-Grams
This paper describes the development of a polyphonic music retrieval system with the n-gram approach. Musical n-grams are constructed from polyphonic musical performances in MIDI using the pitch and rhythm dimensions of music. These are encoded using text characters enabling the musical words generated to be indexed with existing text search engines. The Lemur Toolkit was adapted for the develo...
متن کاملFinding the Better Indexing units for Chinese Information Retrieval
In the processing of Chinese documents and queries in information retrieval (IR), one has to identify the units that are used as indexes. Words and n-grams had been used as indexes in several previous studies, which showed that both kinds of indexes lead to comparable IR performances. In this study, we carried out more experiments to find the better way to index Chinese texts. First, we investi...
متن کاملA comparison of sub-word indexing methods for information retrieval
This paper compares different methods of subword indexing and their performance on the English and German domain-specific document collection of the Cross-language Evaluation Forum (CLEF). Four major methods to index sub-words are investigated and compared to indexing stems: 1) sequences of vowels and consonants, 2) a dictionary-based approach for decompounding, 3) overlapping character n-grams...
متن کامل